Damaged and immature specimens often result in macroinvertebrate data that contain ambiguous parent–child pairs (i.e., abundances associated with multiple related levels of the taxonomic hierarchy such as Baetis pluto and the associated ambiguous parent Baetis sp.). The choice of method used to resolve ambiguous parent–child pairs may have a very large effect on the characterization of invertebrate assemblages and the interpretation of responses to environmental change because very large proportions of taxa richness (73–78%) and abundance (79–91%) can be associated with ambiguouparents. To address this issue, we examined 16 variations of 4 basic methods for resolving ambiguous taxa: RPKC (remove parent, keep child), MCWP (merge child with parent), RPMC (remove parent or merge child with parent depending on their abundances), and DPAC (distribute parents among children). The choice of method strongly affected assemblage structure, assemblage characteristics (e.g., metrics), and the ability to detect responses along environmental (urbanization) gradients. All methods except MCWP produced acceptable results when used consistently within a study. However, the assemblage characteristics (e.g., values of assemblage metrics) differed widely depending on the method used, and data should not be combined unless the methods used to resolve ambiguous taxa are well documented and are known to be comparable. The suitability of the methods was evaluated and compared on the basis of 13 criteria that considered conservation of taxa richness and abundance, consistency among samples, methods, and studies, and effects on the interpretation of the data. Methods RPMC and DPAC had the highest suitability scores regardless of whether ambiguous taxa were resolved for each sample separately or for a group of samples. Method MCWP gave consistently poor results. Methods MCWP and DPAC approximate the use of family-level identifications and operational taxonomic units (OTU), respectively. Our results suggest that restricting identifications to the family level is not a good method of resolving ambiguous taxa, whereas generating OTUs works well provided that documentation issues are addressed.
Taxonomic ambiguities occur when organisms cannot be identified to a consistent taxonomic level (e.g., species) because they are damaged or immature or because the necessary taxonomic keys or expertise are not available. This inconsistency can lead to situations where abundances are reported at multiple levels of the taxonomic hierarchy associated with a taxon (Table 1). The data reported at the higher taxonomic levels (e.g., Baetidae and Baetis sp.) are ambiguous in the grouped data because they contain abundance information, but they add no new taxonomic information. The existence of the higher taxonomic levels already is implied by the existence of data at the lowest taxonomic level (e.g., Baetis flavistriga). The presence of ambiguous taxa is problematic because they introduce inconsistencies in the structure (richness and abundance) of the assemblages that should be resolved before the data are analyzed.
Table 1.
Hypothetical 3-sample data set in which the individual samples contain no ambiguous taxa, but the data set as a whole contains ambiguous taxa (grouped data). The status column identifies whether a taxon in the grouped data is an ambiguous parent (p), the child of an ambiguous parent (c), or both (p,c).
Resolving ambiguous taxa involves applying 1 of 3 basic decisions to each ambiguous parent–child pair: 1) remove the parent or child and its abundance from the data set, 2) merge the child with the parent, or 3) divide the abundance of the parent among ≥1 children, in effect creating operational taxonomic units (OTUs). How these decisions are applied to the data can radically alter the structure and characteristics (e.g., richness and abundance metrics) of the assemblages. For example, the taxonomic richness for the grouped data in Table 1 can be reported to be as little as 2, if all abundances are merged by adding them to the abundance of Baetidae and Hydropsychidae (decision 2), or as many as 10, if the ambiguous taxa Baetidae, Baetis sp., and Hydropsyche sp. are deemed distinct OTUs (decision 3). Similarly, the total abundance can be as little as 132 (i.e., delete all ambiguous taxa) or as great as 258 organisms (i.e., retain all ambiguous taxa or merge all abundances at the highest taxonomic level). These examples, although extreme, illustrate the potential inconsistencies that can occur when different methods are used to resolve taxonomic ambiguities within the same data set.
The effects of sample collection and processing methods (e.g., sampler type, mesh size of nets, subsampling methods, level of taxonomic identifications) on the characterization of invertebrate assemblages have received considerable attention in the literature (Jonasson 1958, Resh and Unzicker 1975, Elliott and Tullett 1978, Resh 1979, Furse et al. 1984, Storey and Pinder 1985, Cranston 1990, Clifford and Casey 1992, Kerans et al. 1992, Resh and McElravy 1993, Marchant et al. 1995, Brinkman and Duffy 1996, Hauer and Lamberti 1996, Vinson and Hawkins 1996, Carter and Resh 2001, Cao et al. 2002a, b). However, the effects of methods of resolving taxonomic ambiguities largely have been ignored or attention has been restricted to a discussion of the effects of taxonomic resolution (Bailey et al. 2001, Lenat and Resh 2001). In the preparation of this study, we encountered only one report (Taylor 1997) that specifically dealt with resolving ambiguous taxa. This absence of reports is a consequence, at least in part, of the fact that many ambiguous taxa are resolved during the process of data generation (i.e., identification and enumeration) when there is little perceived need for keeping a record of ambiguous taxa and how they are resolved. This lack of documentation is particularly problematic when combining data from multiple time periods, multiple laboratories, or multiple agencies because neither the personnel involved in resolving the ambiguous taxa nor the original data (i.e., data with ambiguous taxa) may be available. In consequence, the analyst can not determine how these data compare and cannot reprocess the original data to produce comparable data sets. There is a real need to document the original data and the procedures used to resolve taxonomic ambiguities and to understand how methods of resolving taxonomic ambiguities affect assemblage characteristics and data comparability. Toward that end, we examined how different methods of resolving ambiguous taxa affect the comparability of invertebrate assemblages, assemblage metrics (richness and abundance), and the interpretation of invertebrate responses. Our analyses are based on data collected by the US Geological Survey's (USGS) National Water-Quality Assessment (NAWQA) Program as part of the national urban streams program.
Methods
In the present paper, higher taxonomic levels refer to levels closer to phylum and lower levels are those closer to species. Ambiguous parents (e.g., Baetidae in Table 1) are taxa for which abundances also are reported at lower levels in the taxonomic hierarchy (e.g., B. flavistriga). The taxa at these lower taxonomic levels (e.g., B. flavistriga, B. intercalaris, and B. pluto) are children of the ambiguous parent (Baetidae). A taxon can be both an ambiguous parent and the child of an ambiguous parent if abundances are reported at higher and lower levels of the taxonomic hierarchy (e.g., Baetis sp. and Hydropsyche sp. in Table 1). A sample may contain one or more ambiguous parent–child pairings (grouped data) or none at all (samples 1–3). Even when the individual samples do not contain ambiguous taxa, the data set as a whole may contain ambiguous parents and children. For example, the samples presented in Table 1 contain no ambiguous taxa, but when considered as a group (grouped data) the data set contains 4 ambiguous parents (Baetidae, Baetis sp., Hydropsychidae, and Hydropsyche sp.) and 6 children. The consequence is that the determination of taxa richness for each individual sample in Table 1 is straight forward and noncontroversial, but estimating taxa richness for the entire data set is less so because ambiguous parents were counted as components of taxa richness in the individual samples. This consequence raises a variety of issues, such as whether taxa richness for the data set should be the total of all taxonomic entities in the individual samples (10) or only the nonambiguous entities (6) and whether consideration of only nonambiguous entities should be extended to the estimation of taxa richness for the individual samples to ensure consistency across the data set. How these issues are handled can determine whether the data set can support the study objectives (e.g., characterization of richness or abundance metrics) or analytical methods (e.g., methods that rely on comparing assemblages: ordination, cluster analysis, discriminant analysis). To address these and other issues, the techniques for resolving ambiguous taxa include methods that resolve ambiguities in individual samples and methods that resolve ambiguities across a group of samples (i.e., grouped data).
Methods for resolving ambiguous taxa
The 16 methods for resolving taxonomic ambiguities that we examined (Table 2) are variations on 4 general methods that embody the decisions (i.e, remove, merge, or distribute taxa) that are commonly used when resolving ambiguous parent–child pairs. These general methods are: 1) remove parents, keep children (RPKC), 2) merge children with parent (MCWP), 3) remove parent or merge children with parent (RPMC), and 4) distribute parent among children (DPAC).
Table 2.
Descriptions of the methods used to resolve taxonomic ambiguities and the abbreviations that identify each method. The basic methods are identified by 4-character abbreviations. Variants of each method are identified by a 1- or 2-character suffix. See the Appendix for detailed descriptions of each method. S = single, G = grouped, F = family, O = order, P = phylum, C = conservative, K = knowledge based, L = liberal.
These methods resolve ambiguities by considering all ambiguous parent–child pairs starting with genus (ambiguous parent) and species (children) and progressing though order (O) and family (F) up to phylum (P) and class. These methods can be used to resolve taxonomic ambiguities separately for each sample (S) or collectively for a group of samples (G). Several of these methods (MCWP-S, RPKC-G, MCWP-G, and DPAC-G) have additional variations that address unique properties of the method (e.g., conservative [C], knowledge based [K], and liberal [L]; Table 2). Details on how these 16 methods resolve ambiguous taxa are provided in the Appendix and in Cuffney (2003).
The methods of resolving ambiguous taxa described here are computationally complex and not amenable to hand calculation when dealing with large numbers of samples and taxa. Consequently, computer software (Invertebrate Data Analysis System [IDAS]) was developed to automate the process of resolving taxonomic ambiguities and to provide other tools for processing and analyzing invertebrate data. A detailed explanation of this software and examples of its use in resolving taxonomic ambiguities and processing invertebrate data are given in Cuffney (2003). Data were processed with version 3.7.5 of IDAS. Terrestrial adults were removed from the data, aquatic life stages were combined, abundances were converted to densities (/m2), the lowest taxonomic level was set to species, and the data included OTUs, i.e., provisional and conditional identifications (Moulton et al. 2000).
Invertebrate data sets
Data from 4 urban stream studies in the NAWQA Program (Boston, Massachusetts [BOS]; Raleigh, North Carolina [RAL]; Birmingham, Alabama [BIR], and Salt Lake City, Utah [SLC]) were used to assess the effects of different methods of resolving ambiguous taxa on the characterization, analysis, and interpretation of responses along gradients of urban intensity. These 4 studies are part of an ongoing program that compares biological, chemical, and physical responses along gradients of urban intensity in major metropolitan areas across the USA. The intensity of urbanization is defined by a multimetric urban intensity index (UII) derived from a combination of land-cover, landuse, infrastructure, population, and socioeconomic variables (McMahon and Cuffney 2000, Coles et al. 2004, Cuffney et al. 2005, Tate et al. 2005). These studies were conducted by using a common study design (Coles et al. 2004, Cuffney et al. 2005, Tate et al. 2005) and common sampling (Cuffney et al. 1993 for BIR, BOS, SLC, Moulton et al. 2002 for RAL) and processing protocols (Moulton et al. 2000). All invertebrate samples were processed by the USGS National Water-Quality Laboratory (NWQL) in Denver, Colorado.
Assessing the effects of resolving ambiguous taxa
The effects of the use of different methods to resolve ambiguous taxa were evaluated by 2 approaches. The 1st approach examined changes to the structure of the assemblages (i.e., similarity among assemblages), the values of assemblage metrics, and the relations among sites (i.e., the degree to which the structure of the original data is preserved). The 2nd approach evaluated effects on the interpretation of invertebrate responses to urbanization, an environmental disturbance that strongly degrades invertebrate assemblages (Paul and Meyer 2001, Coles et al. 2004, Carter and Fend 2005, Cuffney et al. 2005). Together, these 2 approaches address important issues related to data comparability, consistency, and interpretation.
Effects on assemblage structure were tested by a 2-stage similarity analysis (2STAGE, Primer 6; PRIMER-E, Plymouth, UK). The 16 methods for resolving ambiguous taxa were applied to the original data (ORIG) for each of the 4 urban studies (BIR, BOS, RAL, SLC) to produce a site-by-taxa matrix for each method and each study (16 methods + ORIG × 4 studies = 68 data matrices). Abundance data in each site-by-taxa matrix were √(x) transformed to reduce the influence of extreme values and a site-by-site similarity matrix was calculated by Bray–Curtis similarity. Spearman rank correlations were then calculated between all pairs of similarity matrices representing ORIG and the 16 methods for resolving taxonomic ambiguities. This produced a method-by-method (17 × 17) correlation matrix that was used in a 2nd-stage nonmetric multidimensional scaling (NMDS) plot to give a separate graphical representation of the similarities among methods for each study (BIR, BOS, RAL, and SLC). A second 2STAGE similarity analysis was used to determine how closely the method-by-method correlation matrices derived for each study resembled one another. This comparison was accomplished by using the method-by-method correlation matrices derived for each study as input into the 2STAGE analysis to obtain a correlation matrix that represented the correlation among studies based on the correlation among methods.
The effects of resolving taxonomic ambiguities on assemblage metrics were investigated for 34 metrics (Table 3) commonly used in bioassessment studies (Barbour et al. 1999). Metrics derived from ORIG, which contained unresolved ambiguous taxa (e.g., Baetidae, Baetis sp., and B. pluto are 3 taxa in ORIG), were compared to metrics derived from assemblages created by applying each of the 16 methods to ORIG. Comparisons were based on the value of the metric expressed as a percentage of the value for ORIG. For example, the metric Ephemeroptera + Plecoptera + Trichoptera richness (EPTr) comparisons would be expressed as % ORIG = ([EPTr from RPMC-S]/[EPTr from ORIG] × 100) and as the correlation (ρ) with ORIG (e.g., correlation between EPTr from RPMC-S and EPTr from ORIG). Separate analyses were conducted for each of the 4 studies. The % ORIG measures the extent to which the method changes the value of the metric relative to ORIG while minimizing the differences in metric values among sites. In contrast, the correlation with ORIG measures how consistently the method changes the value of the metric across all sites in a study. A large difference in % ORIG between methods indicates that the methods do not generate comparable results. Methods that show weak correlations with ORIG or high variability (i.e., coefficient of variation [CV] of the correlations derived for all metrics for all 4 studies) do not operate in a consistent manner across all sites and studies. Consequently, differences among sites may be an artifact of the method used to resolve ambiguous taxa rather than responses to environmental changes.
Table 3.
Assemblage metrics used to assess the effects of methods for resolving taxonomic ambiguities.
Effects on the interpretation of responses along urban gradients were investigated by indirect gradient analysis (Gauch 1982) and correlation analysis. Indirect gradient analysis uses the correlation between ordination site scores and urban intensity to determine how strongly changes in urbanization are associated with changes in assemblages (Coles et al. 2004, Cuffney et al. 2005). Correspondence analysis (CA) was used to obtain ordination site scores along the primary ordination axis (CANOCO, version 4.5; Microcomputer Power, Ithaca, New York). These scores represent ecological distances among sites based on differences in assemblage structure. Abundance data (/m2) were √(x) transformed to reduce the influence of extreme values and rare taxa were downweighted to prevent them from distorting the ordination (Hill 1979). Correlation (Spearman rank) analysis was used to assess the strength of the association between UII, ordination site scores, and each of the 34 assemblage metrics derived for each method. Spearman rank correlations (ρ) were calculated by Systat 9.0 (SPSS, Chicago, Illinois). Consistency in the correlations between metrics and UII was used to assess consistency among methods by correlating (ρ) the correlations between metrics and UII obtained from a particular method (e.g., RPMC-S) with the correlations between UII and the corresponding metrics obtained from ORIG (i.e., the correlations with UII for the 17 metrics were correlated with the corresponding correlations with UII derived from ORIG). Consistency was evaluated for each of the 4 studies. Strong correlations indicate that the method operated consistently over all samples and metrics and methodological differences should not be a factor in interpreting the response to environmental changes (i.e., urbanization).
Results
Ambiguous parents in ORIG constituted a substantial proportion of taxa richness (22–26%) and abundance (28–37%) with relatively low variability among samples and studies (Fig. 1). Children of ambiguous parents constituted 76% (range: 73–78%) of taxa richness and 83% (range: 79–91%) of abundance at the phylum level (Fig. 2) based on the average of the total number and total abundance of children that were associated with ambiguous parents in each study. A small percentage of taxa richness (14%) and abundance (8%) were associated with ambiguous parents at the genus level. Most ambiguous taxa (74% of taxa richness, 72% of abundance) were associated with identifications at the level of tribe, subfamily, or family. Chironomidae constituted ~½ (43% of taxa richness, 41% of abundance) of the ambiguous taxa (Fig. 2) that occurred at this intermediate level, somewhat more than their contribution to total taxa richness (35%) and abundance (28%). Ambiguous taxa above the level of family constituted only 12% of taxa richness and 20% of abundance.
Effects on assemblage similarity
The study-by-study correlation matrix derived by using 2-stage similarity analysis of the method-by-method correlation matrices for each study area showed a very strong correlation (ρ ≥ 0.92) among study areas with low variability in the correlations among methods for the 4 studies (average CV = 4.2%, range 0.1–10.6%). The high correlations and low CVs indicate that the effects of the methods of resolving taxonomic ambiguities are consistent among the 4 studies. Therefore, we used the average correlation values for the 4 studies as the input data for the 2nd-stage NMDS to simplify the presentation of this information. The resulting ordination showed that the assemblages produced by method MCWP-S and its variants (-SF, -SO, -SP) were very similar to one another, but they were so different from other assemblages that they obscured the relations among the other methods. For this reason, the analyses were repeated after removing the MCWP-S methods.
The revised ordination (Fig. 3) clearly differentiated among the assemblages produced by the remaining methods. Assemblages produced by variants (-C, -K, -L) of method DPAC-G were very similar to one another and to method RPMC-G and ORIG. Methods that conserved total abundance at the expense of large proportions of taxa richness (MCWP-GF, -GO, -GP) were grouped together on the positive side of NMDS axis 1. Methods that tended to remove large amounts of taxa richness and abundance (RPKC-S, RPKC-G) were grouped on the negative side of NMDS axis 1, whereas methods that conserved both richness and abundance (DPAC-S, DPAC-G, RPMC-S, RPMC-G, ORIG) tended to be located near the middle of this axis. Methods that resolved ambiguities on the basis of grouped data (DPAC-G, RPMC-G, RPKC-G) tended to produce assemblages that were more similar to ORIG than were the assemblages produced by the equivalent methods that resolved ambiguities separately for each sample (DPAC-S, RPMC-S, RPKC-S). The -G and -S variants of each method are typically distant from one another; the -G variants occupied a relatively narrow range near the middle of NMDS axis 2 and the -S variants lay further out along this axis. The 2nd-stage NMDS clearly established that the different methods and variants result in assemblages with different structures and similarities.
Effects on assemblage metrics
The method used to resolve taxonomic ambiguities had a large effect on the percentage of the ORIG taxa richness and abundance that was retained in the processed data (Fig. 4). Some methods retained little of the taxa richness (e.g., 35% for MCWP-GP) and abundance (e.g., 64% for RPKC-S and RPKC-G), whereas others retained 100% of the abundance and as much as 92% of taxa richness (DPAC-GL). Though the percentages of the ORIG taxa richness and abundance retained by each method differed somewhat among the 4 studies, the degree to which each method changed taxa richness and abundance relative to ORIG was very consistent among the studies (i.e., ρ ≥ 0.98 with ORIG).
Methods that merge children with their ambiguous parents (MCWP-S, MCWP-G) retained the least taxa richness (35–55%) and the most abundance (99–100%; Fig. 4). Attempts to mitigate the loss of taxa richness by deleting ambiguous parents above the family (MCWP-SF, -GF) or order (MCWP-SO, -GO) level generally were not successful, did little to mitigate the loss of taxa richness when samples were processed separately (MCWP-SF, -SO), and had only a modest effect when samples were combined (i.e., MCWP-GF retained more taxa richness than did MCWP-GO or MCWP-GP). The very small differences in the total abundance (<1%) retained by variants -SF, -SO, -GF, -GO show that only a small proportion of abundance is associated with ambiguous parents above the level of family.
Method RPKC retained much more of the taxa richness in ORIG than did MCWP-S and about as much as the equivalent variants of methods RPMC (-S, -G) and DPAC (-S, -GC, -GK, -GL) (Fig. 4). However, method RPKC retained the least amount of abundance (63–72%) of any method because the abundances of ambiguous parents or children were not conserved as they were with the other methods (MCWP, RPMC, DPAC).
Method DPAC and its variants retained both a high percentage of abundance (100%) and taxa richness (71–92%) (Fig. 4). Taxa richness derived by using methods DPAC-S and DPAC-GC were identical to those derived by using methods RPKC-S and RPKC-GC, but the assemblages produced by these methods were not. The percentages of taxa richness retained by using methods DPAC-GK and DPAC-GL were greater than for other variants of method DPAC (-S, -GC) because the -GK and -GL variants create OTUs that can inflate taxa richness. The same is true for the -GK and -GL variants of RPKC.
Method RPMC gave results (richness and abundance) that were intermediate to those obtained by using methods RPKC and DPAC (Fig. 4). It retained slightly less taxa richness (71–78%) than did methods RPKC and DPAC (74–92%) and an intermediate level of abundance (83–94%) compared to RPKC (64–72%) and DPAC (100%). This result was expected because RPMC resolves ambiguous parent–child pairs by applying either method RPKC or DPAC depending upon whether the parent's abundance is greater than the collective abundance of the children.
The values of other richness (Table 4) and abundance (Table 5) metrics also were affected strongly by the method used to resolve taxonomic ambiguities. Some methods (MCWP-GF, -GO, and -GP) eliminated the taxonomic groups required to calculate the metric (e.g., ORTHOr, ORTHO_CHr, TANYr, TANY_CHr, in Table 4; ORTHO, ORTH_CH, TANY, TANY_CH in Table 5; see Table 3 for metric abbreviations). Others, particularly metrics based on ratios, reached values that were multiples of the value in ORIG. This pattern was most evident for richness metrics (e.g., EPT_CHr with method MCWP, TANY_CHr with methods RPKC, RPMC, and DPAC; Table 4), though there were similar examples among the abundance metrics (e.g., TANY_CH with methods RPKC, RPMC, and DPAC; Table 5). Most of the richness (13 of 16) and abundance (11 of 16) metrics followed the pattern previously described for total richness (RICH in Table 4) or abundance (ABUND in Table 5) as indicated by a strong correlation (|ρ| ≥ 0.8) with RICH or ABUND. However, only 8 (EPT, EPHEM, PLECO, TRICHO, DIP, CH, NCDIP, and ODIPNI) of the 16 metrics were strongly correlated with both RICH and ABUND. The strong influence of the method used to resolve ambiguous taxa on the values of metrics suggests that metrics from different sources should be combined only if there is a clear understanding of the comparability of the methods used to resolve ambiguous taxa.
Table 4.
Richness metrics derived using each of the 16 methods for resolving taxonomic ambiguities and expressed as the average percentage of the value in the original data (ORIG) for the 4 urban studies. Metrics marked with an asterisk (*) were strongly correlated (|ρ| ≥ 0.8) with total taxa richness (RICH). Abbreviations for methods and variants of resolving ambiguous taxa are given in Table 2. Abbreviations for assemblage metrics are given in Table 3. NC = metric could not be calculated.
Table 5.
Abundance metrics derived using each of the 16 methods for resolving taxonomic ambiguities and expressed as the average percentage of the value in the original data (ORIG) for the 4 urban studies. Metrics marked with an asterisk (*) were strongly correlated (|ρ| ≥ 0.8) with total abundance (ABUND). Abbreviations for methods and variants of resolving ambiguous taxa are given in Table 2. Abbreviations for assemblage metrics are given in Table 3. NC = metric could not be calculated.
Although the method used to resolve ambiguous taxa had a strong effect on the values of metrics, these effects were consistent among methods as evidenced by strong average correlations (ρ ≥ 0.88) between metrics derived from ORIG and metrics derived from all methods except MCWP (Table 6). Average correlations for richness metrics derived by using method MCWP generally were much lower (0.53–0.82) than for other methods and correlations for some metrics could not be calculated (e.g., ORTHOr, ORTHO_CHr, TANYr, TANY_CHr) because the method (MCWP-GF, -GO, -GP) eliminated the taxa required to calculate the metric. Correlations for abundance metrics derived by using method MCWP were much higher than the equivalent richness metric. Variations of MCWP that delete ambiguous parents above the family (-SF, -GF) and order levels (-SO, -GO) did not produce stronger correlations with ORIG or mitigate problems with the calculation of some metrics. Correlations between metrics derived from the -S variants of MCWP and the ORIG data were much lower than those for the -G variants indicating that the -G variants retained more of the structure of the original data than did the -S variants.
Table 6.
Summary of the correlations (ρ) between the value of the metric derived from the original data and assemblages derived from the 16 methods of resolving ambiguous taxa. Average correlation and CV are based on a total of 68 observations (i.e., 17 metrics × 4 studies) for each method. Abbreviations for methods and variants for resolving ambiguous taxa are given in Table 2.
The CVs for the correlations with metrics derived from ORIG were low (<5%) for richness metrics (Table 6) derived from all methods except MCWP (23–75%) and the -L variants of RPKC (13.5%) and DPAC (13.8%). In contrast the CVs for the correlations of abundance metrics were low for methods DPAC (-S, -GC, -GK, and -GL: <1%), RPMC (-S: 1.4%, -G: 1.3%) and MCWP-G (<7%), moderate for methods RPKC-S and -G (15.3%), and high for method MCWP-S (45.3%). The CVs indicate a considerable difference among methods in the degree of consistency associated with richness and abundance metrics.
The strong correlations between the original and processed data are significant because they established that, on average, all methods with the exception of MCWP preserved the pattern of changes among sites that was present in ORIG even though the method could substantially change the values of the metrics. The CVs indicate that, despite the high degree of correlation, there were instances in which there was considerable variation (i.e., inconsistency) in the correlations with ORIG, particularly in the estimation of abundance metrics. Consequently, both the strength of the correlations with ORIG and the potential CV associated with that correlation must be considered when evaluating the appropriate method for resolving ambiguous taxa and generating richness and abundance metrics. Methods RPMC-S, RPMC-G, DPAC-S, DPAC-GC, and DPAC-GK offer the best combination of correspondence with the original data and consistency among metrics and study areas.
Methods that assign the abundance of ambiguous parents to children on the basis of combined data (RPKC-G and DPAC-G) can produce higher values for taxa richness metrics than existed before the taxonomic ambiguities were resolved (e.g., PLECOr; Table 4). This pattern was particularly true for the -L variants of these methods (RPKC-GL and DPAC-GL) that substitute all possible children for parents and greatly inflate the number of taxa that compose the metric. The liberal approach is unrealistic, but it also is possible for the -K variants (RPKC-GK and DPAC-GK) to produce data sets that have higher taxa richness than the original data. This result occurs when the ambiguous parent is replaced with ≥2 children because the best available information indicates that multiple children should occur at the site. It is important to keep in mind that methods RPKC-G and DPAC-G eliminate all ambiguous taxa in the data set and preserve abundance, but these methods require the analyst to estimate the occurrences of taxa (children) in samples that contain ambiguous parents but no children. In this regard, methods RPKC-G and DPAC-G mimic the application of OTUs, which also estimate the identity of taxa that were not in the original data. However, methods RPKG-G and DPAC-G restrict the identity of these taxa to what is already in the data set (not necessarily the case with other methods of deriving OTUs).
Effects on detecting responses to urbanization
The strengths of the correlations between UII and assemblage metrics derived by using methods RPKC, RPMC, and DPAC were very similar (ρ ≥ 0.94) to the correlations between UII and metrics derived from ORIG (Table 7). This result indicates that these methods (RPKC, RPMC, DPAC) preserve the relations among samples that existed in ORIG. Correlations between UII and metrics derived by using method MCWP and its variants were not always similar to the correlation between UII and metrics derived from ORIG. That is, method MCWP did not always preserve the relations among samples that existed in ORIG. The range between the best and worst correlations with UII all were associated with differences between metrics derived by using variations of method MCWP and those derived by using methods RPKC, RPMC, or DPAC. Resolving ambiguous taxa separately (-S) or for a group of samples (-G) had little effect on the correlation between the metric and urban intensity for methods RPKC, RPMC, and DPAC. However, resolving ambiguous taxa for a group of samples (-G) did improve the correspondence between MCWP and ORIG, particularly for richness metrics derived by using the -GF variant (MCWP-GF).
Table 7.
Summary of the consistency in the correlations between urban intensity (UII) and metrics derived from the 16 methods of resolving ambiguous taxa. Consistency is expressed as the correspondence (ρ) between the correlations with UII obtained from each method and from the original data (ORIG) (i.e., the correlations with UII for the 17 metrics are correlated with the corresponding correlations with UII derived from ORIG). The maximum and minimum correlations are based on the correlation obtained for each of the 4 studies. The number of observations that were strongly (|ρ| ≥ 0.70) correlated with UII (ρUII) and strongly correlated with both UII and ORIG (ρUII ORIG) are based on a total of 68 observations (17 metrics × 4 study areas) for each method. Abbreviations for methods of resolving ambiguous taxa are explained in Table 2.
The number of richness metrics that were strongly correlated (|ρ| ≥ 0.7) with UII was highly consistent (17–18 metrics) among methods RPKC, RPMC, and DPAC (ρUII in Table 7) as were the number (12–13) strongly correlated with both UII and ORIG (ρUII ORIG in Table 7). In contrast, method MCWP, with the exception of MCWP-GF, had fewer richness metrics that were strongly correlated with UII (8–12) and relatively few of these metrics (6–9) also were strongly correlated with ORIG. The number of metrics that were strongly correlated with method MCWP-GF were much more similar (16 for UII, 11 for both UII and ORIG) to the other methods than were the -SF, -SO, -GO, and -GP variants of MCWP. Fewer abundance metrics were strongly correlated to UII (2–3) than were richness metrics and most (2–3) of these metrics were the same metrics that were strongly correlated with UII for the original data (ORIG). The number of metrics that were strongly correlated with UII was relatively small compared to the number of metrics that were considered (i.e., 17 metrics × 4 studies = 68).
With the exception of the MCWP methods, the ability to detect responses to urbanization by using metrics (i.e., |ρ| ≥ 0.7) was not sensitive to the method used to resolve taxonomic ambiguities. Compared to the other methods, the relations with UII obtained by using MCWP were inconsistent among metrics and studies, although method MCWP-GF approached the levels obtained with other methods. This inconsistency suggests that method MCWP should not be used to detect or interpret responses.
All resolution methods except MCWP-S showed similarly strong correlations between UII and the primary ordination axis (CA axis 1) site scores (Fig. 5) with eigenvalues for the first axes (0.2–0.3) indicating a relatively long gradient. The strength of the correlations derived by using method MCWP-S and its variants (-F, -O, and -P) varied among studies and tended to be less than the other methods. The -G variants (MCWP-G) were not as variable as the -S variants and were comparable to the other methods. Detecting responses to urbanization by using assemblage ordination was less sensitive to the choice of method for resolving ambiguous taxa than were responses by using assemblage metrics because all methods except MCWP-S gave consistent results.
Discussion
Ambiguous taxa are a common feature of invertebrate data sets because of problems with the identification of immature or damaged specimens or the presence of species not covered by available taxonomic keys. Even a few organisms identified at relatively high taxonomic levels (e.g., order or family) can, depending upon the method used to resolve ambiguous taxa, affect a very large proportion of the taxa in a data set (e.g., 75–85% in our data sets). The assemblages and assemblage characteristics (e.g., metrics) produced by the different methods may not be comparable because the methods used to resolve taxonomic ambiguities differ in how they retain, delete, or combine taxa.
The lack of comparability among methods is a critical issue because combining data generated by using incomparable methods can lead to situations where the differences in assemblages or assemblage metrics arise from methodological rather than environmental causes and the resulting interpretation of the data may be erroneous. The errors that can be introduced by combining data generated by incomparable methods of resolving ambiguous taxa (Fig. 4, Tables 4, 5) can be large and probably are on a par with errors introduced by using incomparable methods of data collection and processing. However, there is little discussion of this problem in the literature (Taylor 1997) compared to other sample-collection and processing issues (see compilations by Resh 1979, Resh and McElravy 1993, Carter and Resh 2001).
The methods used to resolve ambiguous taxa must be evaluated carefully before data sets can be combined. Thorough documentation of the process used to resolve ambiguous taxa is essential for determining whether data are comparable. Unfortunately, the procedures used to resolve ambiguous taxa generally are not well documented and discussion of this topic generally is restricted to specifying the desired taxonomic level for identifications (e.g., Rosenberg and Resh 1993, Barbour et al. 1999, Bailey et al. 2001, Lenat and Resh 2001, Carter and Fend 2005, Kreutzweiser et al. 2005, NCDENR 2006) rather than to documenting how ambiguous parent–child pairs are resolved. This level of discussion does not provide sufficient information to assess adequately whether assemblages have been processed by using comparable methods. Even if the methods of resolving ambiguous taxa are carefully documented, it may not be possible to modify assemblages so that they are comparable. For example, assemblages processed by using methods MCWP and DPAC can not be reprocessed to produce comparable assemblages because the required taxonomic information has been removed from the data sets. One means of addressing this issue is to provide uncensored data sets (i.e., the data sets with ambiguous taxa) as part of the documentation process (Taylor 1997). This practice would make it possible to form comparable data sets by processing the uncensored data sets by using a common method of resolving ambiguous taxa. Both the US Environmental Protection Agency's Environmental Mapping and Assessment Program (EMAP) ( http://www.epa.gov/emap/html/data.html) and the US Geological Survey's NAWQA Program ( http://water.usgs.gov/nawqa/data) provide uncensored data that allows users to resolve ambiguous taxa by using methods that generate data sets that are comparable with other data sets.
The issue of data comparability is not as critical when the objective of the analysis is to understand the relative differences among sites or samples as opposed to generating comparable assemblages and assemblage characteristics (i.e., metrics). All methods except MCWP resulted in assemblages that preserved the relative differences among sites even though the assemblages and assemblage metrics differed among methods. As long as the method (RPKC, RPMC, DPAC) was used consistently within the study, the choice of method was not critical to the interpretation of the change in assemblages among sites or samples (e.g., response along the urban gradient) because all methods except MCWP produced assemblages that were highly correlated with ORIG (i.e., the data with ambiguous taxa) and with each other.
Selecting the appropriate method for resolving ambiguous taxa
The different methods for resolving ambiguous taxa were evaluated by comparing each method against 13 criteria (Table 8) that defined a hypothetical ideal method. The ideal method should eliminate ambiguous taxa from the entire data set, conserve a high degree of the taxa richness and abundance in the original data, and should not require estimating the presence or abundances of missing taxa. It should not overestimate taxa richness metrics when compared to the original data, and it should not preclude calculation of metrics by eliminating taxonomic groups. It should act in a consistent fashion across all samples and preserve or enhance the differences in assemblage structure that existed among samples in the original data (i.e., it should be highly correlated with the original data and exhibit low variability among studies), and it should preserve or enhance the response to environmental changes across sites (e.g., correlation with UII).
Table 8.
Summary of the 13 criteria used to define the ideal method for resolving taxonomic ambiguities and to evaluate the suitability of the 16 methods used to resolve ambiguous taxa. The highest possible suitability index value is 21. Abbreviations for methods and variants for resolving ambiguous taxa are explained in Table 2. Y = yes, N = no, H = high, L = low, M = medium, O = overestimate, UII = urban intensity index.
The ability to conserve taxa richness and abundance was evaluated on the basis of the relations among methods shown in Fig. 4. The degree to which taxa richness was conserved was rated as high (H) if the value of taxa richness was close to that produced by method RPKC, low (L) if it was substantially less (e.g., MCWP-GP), medium (M) if it fell between low and high, and overestimate (O) if it was substantially greater than method RPKC (e.g., DPAC-GL). Method RPKC-S was used as the basis of comparison for taxa richness because this method preserves the maximum amount of taxa richness without resorting to the estimation of new taxa from ambiguous taxa (i.e., OTUs). The degree to which abundance was conserved was based on the percentage of total abundance in ORIG that each method retained: H (98–100%), M (85–97%), and L (<85%). The similarity among assemblages created by each method was assessed on the basis of the method-by-method correlation matrices developed for each study during the 2-stage similarity analyses. Each method was rated on the basis of the number of methods that were highly correlated (|ρ| ≥ 0.8) with it (16 methods × 4 studies = 64 possible comparisons: H ≥ 40, 20 < M < 40, L ≤ 20). The degree of consistency among samples was evaluated on the basis of the CV for the correlation of richness or abundance metrics derived for each method and ORIG (H ≤ 10%, 10% < M ≤ 20%, L > 20%; Table 6). The ability to detect responses to urbanization was determined by the number of urban studies where the CA axis 1 site scores were strongly correlated (|ρ| ≥ 0.7) with UII (H = 3 or 4, M = 2 or 3, L = 0 or 1; Fig. 5) and by the sum of the number of richness metrics that were strongly correlated (|ρ| ≥ 0.7) with UII and the number that also were strongly correlated with UII on the basis of ORIG (H ≥ 27, 20 < M < 27, or L ≤ 20 metrics; Table 7). The number of abundance metrics that were strongly correlated with UII was low and did not vary much among methods (Table 7) so this performance characteristic was not considered. Deviations from the ideal method were scored and summed to produce a suitability index. Yes/No responses were scored as 1 if the response matched that of the ideal method or 0 if it did not. High, medium, and low responses were scored as 2, 1, and 0, respectively, except for the conservation of richness, which was scored as 3 (H), 2 (M), 1 (L), and 0 (O).
Method RPMC-G had the highest suitability index with a score of 20 out of 21. Methods RPMC-S, DPAC-S, and DPAC-GC were close seconds with scores of 19. Methods RPKC-S, RPKC-GC, and MCWP-GF also had relatively good scores (15–17). The -S variants of method MCWP had the lowest scores (8). The -G variants of method MCWP, particularly the -GF variant, had scores that were substantially higher (13–15) than MCWP-S, but that were still much lower than other methods. Removing ambiguous parents above the level of family (-F) improved the score of method MCWP-G, but not MCWP-S. The -K and -L variants of methods RPKC and DPAC had less desirable characteristics and lower scores than did the -C variants largely because they tended to overestimate taxa richness.
On the basis of these criteria, methods RPMC (-S, -G) and DPAC (-S, -GC) are the most appropriate for resolving ambiguous taxa for analyses involving quantitative data. However, the final choice should be based on a detailed review of the characteristics of each method (Table 8) and how they relate to the aspects of assemblage structure that are important to the study objectives and analysis methods. For example, if the analytical method requires taxonomic consistency within and among samples (e.g., to provide consistent assemblages for ordination analysis) then RPMC-G and DPAC-GC would be the most appropriate methods. DPAC-GC would be preferred if the user wanted to include OTUs in the assemblages and RPMC-G would be preferred if the user wanted to exclude OTUs. On the other hand, if the study objectives are focused on maximizing taxa richness within each sample, then either method RPMC-S or DPAC-S would be appropriate depending on whether or not the user wanted to include OTUs.
The scores used to evaluate suitability in Table 8 are specific to quantitative analyses, that is, methods that preserve both taxa richness and abundance are rated higher than those that conserve taxa richness at the expense of conserving abundance. Consequently, if the analyses are based solely on qualitative (i.e., presence/absence of taxa) information, then another method with a much lower score might be more appropriate. For example, methods RPKC-S or RPKC-G would be appropriate for qualitative analyses because the abundance information carried by the ambiguous parents would not be relevant to the analysis and the taxonomic information that they contained would already be implied by the presence of the children of the ambiguous parents. The choice of RPKC-S or -G would depend upon the desire to include or exclude OTUs.
Correspondence to family-level identification and OTUs
Methods MCWP-SF and -GF closely approximated the process of minimizing ambiguous taxa by restricting identifications to the family level (i.e., 2-stage similarity analysis relating these methods to family-level identifications, lowest taxonomic level set to family in IDAS, for BIR, BOS, RAL, and SLC: ρ > 0.99 for MCWP-GF and 0.70–0.76 for -SF). Restricting identifications to the family level has been advocated as a means of reducing variability introduced by taxonomic ambiguities (Bailey et al. 2001), although others (Lenat and Resh 2001) have made the case for the value of including more detailed taxonomic information. Our results show that resolving ambiguous taxa by restricting identifications to the family levels (MCWP-SF and -GF) can radically alter the structure of assemblages and the values of the assemblage metrics, affect data comparability, compromise the ability to detect responses to environmental stressors (e.g., UII), and can introduce variability among studies that is not encountered with other methods. For these reasons, we suggest using other methods to resolve ambiguous taxa (e.g., RPMC or DPAC) that have less effect on the underlying data, produce results that are more consistent and comparable among studies and methods, and provide more taxonomic information.
Method DPAC mimics the process of reducing ambiguous taxa by assigning ambiguous parents to OTUs. The variants of method DPAC that we examined (-S, -GC, -GK, and -GL) limit OTUs to taxa that exist in the original data set. Even with this restriction, method DPAC can inflate estimates of taxa richness and alter assemblage structure compared to other methods, although consistency among samples and relations with environmental factors (i.e., UII) are not strongly affected. Despite the possibility of overestimating taxa richness, the suitability indices for method DPAC were high (18–20) for all variants except -GL (14), which grossly overestimated richness relative to other methods. Despite the high scores for method DPAC, we consider estimation of missing taxa to be an undesirable characteristic of a method for resolving ambiguous taxa. In large part, our reservation toward this approach is associated with the difficulty of ensuring that OTUs are used consistently within a group of samples and the general lack of information available to support the identity of OTUs (i.e., the morphological characteristics that support the OTUs are not described). This lack of information can create problems when combining data from different sources because it is generally very difficult or impossible to determine if OTUs (e.g., Baetis sp. 1) refer to the same morphologic group in each data set. These problems can be addressed by naming OTUs on the basis of their affinity with a described taxon (e.g., Centroptilum/Procloeon sp., Stenonema modestum/smithae, Hydropsyche sp. nr. elissoma) or by maintaining a reference collection to support the OTUs. However, affinities are only approximations that can change over time and reference collections are difficult and expensive to maintain and to share with diverse groups. Thus, although OTUs can work well for individual studies, it would be better, as with other methods of resolving ambiguous taxa, to make uncensored data available also so that data can be processed to match other data sets or to accommodate changes in taxonomy.
Separate estimation of taxa richness and abundance
Our comparison of methods for resolving ambiguous taxa has focused on removing ambiguities from quantitative samples and then characterizing taxa richness and abundance characteristics for the resulting assemblage. This approach is a compromise between retaining the maximum taxa richness and the maximum abundance in the original data. Alternatively, different methods of resolving ambiguous taxa could possibly be used for characterizing richness and abundance attributes in the data. For example, richness metrics could be calculated from assemblages derived by method RPKC-S, which maximizes retention of taxa richness without OTUs, and abundance metrics could be estimated by method DPAC-S, which maximizes retention of abundance. This type of approach is problematic because it decouples estimation of the richness and abundance characteristics from the underlying assemblage data. For this reason, we take the position that decoupling the derivation of richness and abundance metrics by using different methods to resolve ambiguous taxa is not appropriate and should be avoided. There are methods (e.g., RPMC and DPAC) that do a good job of preserving both taxa richness and abundance, so it should not be necessary to resort to decoupling the estimation of taxa richness and abundance.
Acknowledgments
We thank Doug Harned of the USGS for his unflagging support of this work. Bob Ourso and Jim Carter of the USGS contributed constructive comments on early drafts of the article. We thank all of the landowners who have so generously allowed us access to their property while conducting our urban stream studies. Elise Giddings, Humbert Zappia, and Jim Coles of the USGS generously provided data. Many USGS biologists contributed their time and energy to testing and evaluating the methods of resolving ambiguous taxa that are incorporated in the IDAS software. We greatly appreciate the timely, comprehensive, and thoughtful comments from John Van Sickle of the USEPA, Corvallis, Oregon, and 2 anonymous referees. Their comments have greatly improved the quality of the article. This work was supported by the US Geological Survey's National Water-Quality Assessment Program.
Literature Cited
Appendices
Appendix. Explanation of methods used to resolve ambiguous taxa
The methods of resolving ambiguous taxa are based on variations of 4 methods: 1) remove parent, keep child (RPKC), 2) merge child with parent (MCWP), 3) remove parent or merge child depending on their abundances (RPMC), and distribute parents among children (DPAC). Ambiguous parent–child pairs are resolved either separately for each sample (-S variants) or for a group of samples (-G variants) starting with species–genus and progressing up to class–phylum. The rules for resolving ambiguities (i.e., which taxa to remove, merge, or distribute) for -G variants are derived by applying the equivalent -S method to the grouped sample data (Table A1) and then applying these rules to each sample in the data set. Data sets processed by using the -G variants are free of all ambiguous taxa. Samples processed by using the -S variants are free of ambiguous taxa, but ambiguous taxa may still be present when the data are considered as a group. The examples of resolving ambiguous taxa have been simplified by using the sum of samples 1 to 4 (Table A1) to represent both the grouped data and a single sample (Sample A; Table A1). Consequently, the -S variant examples illustrate both how ambiguities are resolved for a single sample (Sample A) and how the rules for resolving ambiguous taxa are defined for a group of samples (grouped data).
Table A1.
Assemblages produced by resolving ambiguous taxa using method RPKC (remove parent, keep children). The single (-S) variant resolves ambiguous taxa for sample A and defines how ambiguous taxa are resolved in the grouped data formed by combining samples 1 to 4. The conservative (-C), knowledge-based (-K), and liberal (-L) options of grouped (-G) variants resolve ambiguous taxa for sample 2 on the basis of the rules defined by method RPKC-S.
RPKC
This method removes ambiguous parents and their abundance from the data set while keeping the children of the ambiguous parents (Table A1). The -S variant (RPKC-S) simply removes all taxa identified as ambiguous parents. The -G variants apply the RPKC-S method to the grouped data to determine which parents to remove and then apply these decisions to each of the individual samples (samples 1 to 4). A sample being resolved by using the -G variant may contain an ambiguous parent that is to be removed (e.g., Acentrella sp. in Sample 2; Table A1) but no children of that parent (A. parvula and A. turbida). In these cases, the parent's abundance is assigned to one or more children of the parent (method RPKC-G; Table A1) by using 1 of 3 approaches: conservative (-C) assigns the parent's abundance to the child that occurred at the most sites in the data set; knowledge based (-K) distributes the parent's abundance among one or more children on the basis of knowledge of the taxa that occur at similar sites; and liberal (-L) distributes the parent's abundance among all children associated with the ambiguous parent in the grouped data. The parent's abundance is distributed among the children in proportion to the relative abundance of the children in the grouped data. For example, RPKC-GL distributes the abundance of Acentrella sp. in sample 2 among the 2 children, A. parvula and A. turbida, in proportion to their abundance in the grouped data, 85.5% and 14.5%, respectively. In the -C option (RPKC-GC), the entire abundance of Acentrella sp. (30) is assigned to A. parvula because it occurs at 3 of the 4 sites, whereas A. turbida occurs at only 2 sites. In the -K option (RPKC-GK), the entire abundance of Acentrella sp. is assigned to A. turbida because this species was found more commonly than A. parvula at sites that were similar to Sample 2.
MCWP
This method adds the abundances of the children to that of their ambiguous parent and then removes the children from the data set for both the -S and -G variants (Table A2). The presence of taxa identified at high taxonomic levels (e.g., family, order, or class) can lead to huge losses in taxa richness when lower taxonomic levels are merged into the highest level in the taxonomic hierarchy that has abundance information. Three variations that remove ambiguous parents occurring above the level of family (-F), order (-O), or phylum (-P) before resolving ambiguities were used to alleviate this problem. The process begins by eliminating ambiguous parents that occur above the specified taxonomic levels (i.e., eliminate Zygoptera, Ephemeroptera, and Insecta for family [F], Insecta for order [O], and no taxa for the phylum [P] option because there are no taxonomic levels higher than phylum) and then resolving the remaining ambiguous parent–child pairings. Ambiguous taxa at the family level (Baetidae) caused abundances to be totaled at the family level for method MCWP-SF. Specimens that could be identified only to suborder (Zygoptera) and order (Ephemeroptera) caused all abundances to be added into Ephemeroptera and Zygoptera after eliminating data above the order level (i.e., Insecta) for method MCWP-SO. The presence of 6 specimens that could be identified only to the class level (Insecta) caused all abundance data (760) to be added into Insecta for method MCWP-SP.
Table A2.
Assemblages produced by resolving ambiguous taxa using method MCWP (merge children with parents). The family (-F), order (-O), and phylum (-P) single (-S) variants resolve ambiguous taxa for sample A and define how ambiguous taxa are resolved in the grouped data formed by combining samples 1 to 4. The grouped (-G) variants resolve ambiguous taxa for sample 3 on the basis of the rules defined by the corresponding variant of method RPKC-S.
The -G variants of method MCWP (Table A2) apply the appropriate variant of method MCWP-S (-SF, -SO, or -SP) to the grouped data to define the rules for removing ambiguous parents and merging ambiguous parent–child pairs. For example, MCWP-SF removed the ambiguous parents Insecta, Ephemeroptera, and Zygoptera and merged all mayflies into the family Baetidae. This results in the removal of the 100 organisms that were identified to Ephemeroptera in sample 3, the assignment of 66 organisms (Acentrella sp., A. parvula, Baetis sp.) to Baetidae, even though this sample contained no organisms identified to this level, and the removal of 10 organisms identified to Zygoptera. The -O variant merged all the mayflies into Ephemeroptera, but kept the 10 organisms identified to Zygoptera. The -P variant merged all data into the Insecta.
RPMC
This method resolves ambiguous taxa by applying either method RPKC or MCWP to each ambiguous parent–child pair (Table A3). If the collective abundance of the children is greater than the abundance of the ambiguous parent, method RPKC is used; otherwise method MCWP is used. The abundances that were eliminated in previous ambiguous parent–child pairings are considered part of the total abundance of the children in subsequent pairings. Table A4 illustrates the application of RPMC-G through 3 iterations. The 1st iteration resolves the genus–tribe pairing by using method RPKC to eliminate the ambiguous parent Tanytarsini because its abundance (5) is less than that of the children (65). The eliminated abundance (5) is added to the carry over row for use in the next pairing (tribe–subfamily). This carry over ensures that subsequent comparisons of ambiguous parent–child abundances reflect all the children associated with the parent. For example, the tribe–subfamily pairing compares the abundance of the ambiguous parent Chironominae (225) to that of its children (200) and the amount carried over from previous pairings (5). The collective abundance of the children (205) is less than the parent (225) so method MCWP is used to merge the children with the parent (Chironominae = 430) and the abundance of Tanytarsini (5) is eliminated from the carry over row because it is now part of the abundance reported for Chironominae. In contrast, the abundance of Diamesinae (10) is less than that of its children (100) so Diamesinae is eliminated (RPKC) from the data and added to the carry over row (10) for consideration in the subfamily–family pairing. In the final pairing (subfamily–family), the abundance of the ambiguous parent Chironomidae (100) is less than the combined abundance of the children (530) and the carry over (10), so the abundance of the parent is eliminated (RPKC) from the data set and added to the carry over value, which becomes 110 and would be considered in the next parent–child pairing (family–order).
Table A3.
Assemblages produced by resolving ambiguous taxa using methods RPMC (remove parent or merge children with parent) and DPAC (distribute parent among children). The single (-S) variants resolve ambiguous taxa for sample A and define how ambiguous taxa are resolved in the grouped data formed by combining samples 1 to 4. The conservative (-C), knowledge-based (-K), and liberal (-L) options of grouped (-G) variants resolve ambiguous taxa for sample A on the basis of the rules defined by the method RPMC-S or DPAC-S.
Table A4.
An example of iteratively resolution of ambiguous parent–child pairs in a progression from genus to family using the grouped (-G) option of method RPMC (remove parent or merge children). This method considers the abundance of organisms that have been removed in previous iterations (carry over) when comparing ambiguous parent–child abundances in subsequent iterations.
The rules used for the -G variant (RPMC-G) are derived by applying RPMC-S (Table A3) to the grouped data. When these rules are applied to sample 3, the Ephemeroptera, Acentrella sp., and Baetis sp. are removed from the sample even though the combined abundance of the children of Ephemeroptera in sample 3 (100) is greater than the children (66). This removal occurs because the abundance of the Ephemeroptera in the grouped data (100) is much less than that of the children (536) so Ephemeroptera is deleted from the grouped data and all samples. Baetis sp. is removed for the same reason even though it is more abundant (35) in sample 3 than are the children (0). The removal of Acentrella sp. and the retention of Zygoptera also occur by the rules identified for the grouped data.
DPAC
This method distributes the abundance of the ambiguous parent among the children in proportion to the relative abundance of each child in the sample (Table A3). If more than one child is associated with the ambiguous parent, the parent's abundance is divided among the children in proportion to the relative abundance of each child. Table A5 illustrates how abundances of parents are distributed among their children for method DPAC-S (Table A3). The first ambiguous parent–child pairs occur at the genus–species level (Table A5) where the abundances of the ambiguous parents (Acentrella sp. and Baetis sp.) are divided among their respective children in proportion to the abundance of the children (e.g., 85.5% of Acentrella sp. is added to A. parvula and 14.5% is added to A. turbida). The next ambiguous parent–child pairings occur at the family–genus level where the abundance of Baetidae (12) is divided among the species of Acentrella and Baetis on the basis of their relative abundances after resolving ambiguous parents at the genus level (e.g., 20.9% would be assigned to A. parvula). Taxa richness decreases with each iteration, but the total abundance remains the same.
Table A5.
An example of how the single (-S) variant of method DPAC (distribute parents among children) resolves ambiguous taxa in the original data (ORIG) starting with ambiguous parent–children pairs at the genus–species level and progressing to the class–order level. This method distributes the abundance of the ambiguous parents (p) among the children in accordance with the relative abundance of each child. The relative abundance of each child is determined on the basis of the distribution of abundances that occurred in the previous level. Boldface type indicates the abundances that were modified by the distribution of parent's abundances in each iteration from genus to class.
Use of method DPAC to resolve ambiguous taxa for a group of samples (DPAC-G) can lead to situations where the rules developed for the grouped data (DPAC-S; Table A3) call for dividing an ambiguous parent among its children, but none of the children are present in the sample (e.g., sample 2 in Table A3 contains the ambiguous parent Acentrella sp., but neither of the children, A. parvula and A. turbida). In these situations, we have elected to use the same approach as previously used for method RPKC-G. We pick the children over which the parent's abundance will be distributed on the basis of the children in the grouped data. This process generates new taxa that were not in the original sample (OTUs). The -C, -K, and -L approaches are the same as used for RPKC-G. The abundance of the ambiguous parent is distributed among the children in accordance with their relative abundance in the grouped data (Table A3).